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ABSTRACT 


In the field of health and medicine, there is a very important term known as 
clinical trials. Clinical trials are a type of activity that studies how the safest 
way to treat patients is. These clinical trials are usually written in unstructured 
free text which requires translation from a computer. The aim of this paper is 
to classify the texts of cancer clinical trial documents consisting of 
unstructured free texts taken from cancer clinical trial protocols. The proposed 
algorithm is conditional random Fields and bigram features. A new 
classification model from the cancer clinical trial document text is proposed to 
compete with other methods in terms of precision, recall, and f-1 score. The 
results of this study are better than the previous results, namely 88.07 
precision, 88.05 recall and f-1 score 88.06. 
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1. INTRODUCTION 

Text classification is defined as labeling natural language text documents with classes or categories 
of predetermined sets [1, 2]. Text classification is an important component in many NLP applications, such as 
sentiment analysis [3], relationship extraction [4] and spam detection [5, 6]. Text classification has also 
attracted the attention of researchers to continue to develop innovations and testing, including those sourced 
from cancer clinical texts commonly referred to as cancer clinical trials. Clinical trials are research that 
implicate people. Through clinical trials, the medical party detect new ways to enhance treatments and the 
quality of life for people with illness [7]. 

Classification of clinical trials text has been developed through several approaches such as statistical 
approaches [8]; eligibility screening approach [9]; machine learning approach [10]; deep neural 
network [11, 12] clustering [13], convolutional neural network [14] and approaches to fine grained document 
clustering [12], It appears that the use of deep learning methods in the modeling of clinical trial classification 
is still limited. Until now, the classification of clinical trial texts is still being developed. The use of deep 
learning methods has become a great hope in solving clinical trial problems, especially modeling and 
performance and computational improvement [14]. However, due to the high dimensions and sparse text data, 
and the complex semantics of natural languages, text classification presents its own challenges [15]. 
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Currently, deep learning technology [16] has achieved extraordinary results in many areas, such as 
computer vision [17] speech recognition, [18] and text classification [19]. Vincent menger [20] states that in 
some cases approaches with deep learning techniques applied to the classification of clinical texts can produce 
conclusions that match expectations, but will be different if tested on other clinical datasets and with different 
domains and different sizes. One of the studies on clinical trials that attracts attention is research from [21] 
(Bustos and Pertusa 2018). In this study, there is an explanation of predictions of patients with cancer, whether 
the patient is eligible or not worthy of being called a cancer sufferer based on medical records from doctors. 

The method used in this research is K-Nearest neighbor (KNN), support vector machine (SVM), 
convolutional neural network (CNN) and FastText. In this study, the results displayed state that the KNN 
method is better than complex models such as SVM, CNN and FastText. However, the time spent by KNN in 
assessing the performance of this dataset is still long, and its performance is still very slow, this means that this 
method is only effective but not efficient. 

One of deep learning method that is able to increase the computational value of text classification is 
conditional random fields [22, 23]. A conditional random field (CRF) is a standard model for predicting the 
most likely sequence of labels that corresponds to a sequence of inputs. The purpose of this study is to improve 
the text classification of cancer clinical trials using the conditional random fields method. Conditional random 
fields (CRF) is a probabilistic model to overcome segmentation and labeling of sequence data. CRF is used to 
combine features to get a model that will be used to assess the level of importance of sentences [24]. One of 
the applications of CRF is text classification [25, 26]. 

Several studies related to the above theme such as the cancer classification [27, 28] and clinical text 
classification [29, 30]. The other research that discusses the clinical text classification such as Zhang and 
Fushman [10], which discusses the development of methods for the automatic classification to facilitate the 
matching of ClinicalTrials.gov dataset patient trials for specific populations such as people living with HIV or 
woman pregnant. Yao et al. [31] produced a clinical text classification on obesity using the rule base feature 
and CNN methods. Chuan [14] proposes an active deep learning approach to automatically classify clinical 
trial. Jasmir et al. [12] use DNN and fine-grained document to improve classification text of clinical trial. 

The conditional random fields method has also been discussed in several themes such as extracting 
causal relations from emergency [32], named entity recognition [33], and biomedical text [24]. Moharasan and 
Ho [34] which conduct research on clinis texts with semi-supervised conditional random fields, they proposed 
and evaluated a two-stage semi-supervised novel for the temporal event extraction, in this work they prove that 
the influence of undocumented clinical texts helps to significantly increase the accuracy of temporal event 
extraction. But it still needs to be developed with other additional methods and features. Therefore, to answer 
of this problem, we built a new model using conditional random fields (CRF) and bigram feature as our 
methodology to improving computational value. 


2. MATERIAL AND METHOD 

CRF based model as shown in Figure 1. The dataset is parsed into tags and text, then preprocessing 
applications such as token creation and bigram feature creation and sentence detection. The next process is 
conducting training with the CRF and producing a CRF model. At the same time, the tagging process was also 
carried out. Then the trained model will label it according to its features 


2.1. Dataset 

Data were taken from clinical reports. Data were extracted from clinical trial protocols about cancer 
originating from the National Institutes of Health: Bethesda, MD, USA, which can be downloaded from 
https://clinicaltrials.gov. This data comes from the fields of intervention, conditions, and feasibility written in 
unstructured free text language. Information in the eligibility criteria is a series of phrases and or sentences that 
are displayed in free format, such as paragraphs, bulleted lists, and enumeration lists. 


2.2. Preprocessing 

Preprocessing has a very important role in the technique and application of text mining. This is the first 
step in the text mining process. In this paper, we discuss three main steps for preprocessing, namely, stopword, 
stemming and TF/IDF [35]. All eligibility criteria are converted into simple word sequences. Based on this data, 
the mapping was carried out into the components of the patient's complaints (main complaints, onset, other 
complaints, information, frequency of attacks, nature of attacks, duration, location, course of disease, previous 
treatment history, and the consequences of disturbances that arose). All words are lowercase letters. At this stage, 
information extraction is carried out, in which unnecessary words will be deleted so that the final result required 
in the classification is obtained. But we don't remove stop words because they are semantically relevant to clinical 
statements. Next, replace the numbers, arithmetic signs, comparison with the text. 
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Figure 1. Conceptual model 


2.3. Bigram 

In general bigram or digram is a sequence of two adjacent elements of a token string, which usually 
consists of several letters, syllables, or words. Bigram is part of n-gram. The frequency distribution of each 
bigram in a string (or data type) is usually used for processing simple statistical data [36, 37]. However, in this 
study, the meant bigram is the phrase most often found in the medical field. bigram which is often found then 
changed into text form. Bigram can take the form of an expression. Bigrams can also help solve the conditional 
probability of a token being previously completed, when the conditional probability is applied: 


P(Wn-1,Wn) 


P(WalWn-1) = P(Wy-1) 


(1) 


That is, the probability P() of a token Wagiven the preceding token wn-1 is equal to the probability of their 
bigram, or the co-occurrence of the two tokens P(wn-1, Wn) , divided by the probability of the preceding token. 


2.4. Conditional random fields 

Conditional random fields (CRF) is a probabilistic model that is widely used in the segmentation and 
labeling process of a data sequence. CRF is a mixing method between hidden Markov model and maximum 
entropy Markov model [38] CRF maintains the advantages of supervised and unsupervised methods while 
avoiding the disadvantages of both methods. By acquiring the advantages of discriminative modeling that the 
generative model does not have and overcoming the shortcomings of the generative model such as the problem 
of dependence on high assumptions in the hidden Markov model (HMM) and the usual labeling problems that 
occur in the maximum entropy Markov model (MEMM). 

The set of features used to build the model using CRF based approach include Word and its Context 
Word, PoS of Word and its Context Words and Prefix and Suffix Information.The conditional random field 
(CRF) is added to the model in order to add the constraint relationship between the labels, to ensure that the 
predicted labels are valid and to find an optimal label sequence [39]. 


SAY) =D aU viet t hyd) (2) 
eS(X,Y) 

PIX) = saa (3) 

Y =argmax s(X,Y) (4) 


3. RESULTS AND ANALYSIS 
3.1. Precision recall and F-1 score 

Measures of classification performance can be defined based on the confusion matrix [40] as seen in 
Table 1. The confusion matrix provides information on the comparison of the classification results carried out 
by the system (model) with the actual classification results. The confusion matrix is in the form of a matrix 
table that describes the performance of the classification model on a series of test data whose true values are 
known. Precision is a representation of uniformity and repetition of measurements. Precision is the degree of 
excellence, on the performance of an operation or technique used to get results. 


TP 
TP+FP 





(5) 


Precision = 
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Recall is a measure of the success of a system in finding and retrieving information. Furthermore, 
F-Measure is a process of calculating evaluation by combining precision and recall calculations. recall and 
precission in a situation can have different weights. The measure that displays the reciprocity between recall 
and precission is F-Measure which is the average harmonic weight and realization and precision. 


TP 
TP+FP 


Recall = (6) 
F-Measure or Fl-score is one of the evaluation calculations in information retrieval that combines 

recall and precission. The recall value and precission in a situation can have different weights. The size that 

displays reciprocity between recall and precission is F-Measure which is the mean harmonic weight and reall 


and precission. 


2+(Recall+Precision) 


F1 Score = (7) 


Recall+Precision 


Table 1. Measures of classification performance 


Actual Class 
Predicted Class=Yes Class=No 
Class=Yes TruePositive=TP False Positif=FP 


Class Class=No False Negatif=FN TrueNegative=ITN 


3.2. Model evaluation and validation 

In this study we conducted two experiments, each data from 1x10° to 5x10° data, the first experiments 
is CRF without bigram and second experiment is CRF with bigram. For small data, it seems that the result is 
not significant between using bigram and using without bigram, but the results will affect in the larger data. 
So, by adding the bigram feature combined with conditional random fields, it can increase the computational 
value, especially for larger data. 


3.2.1. CRF without bigram 

The initial stage is evaluation using the conditional random field method only. Table 2 and 
Figure 2 are the CRF results without bigram. Data processing starts from small data which is 100,000 and 
multiples to the largest data which is 500,000 data. It can be seen that the evaluation value tends to increase 
with the addition of data. the more data that is processed, the greater the evaluation value will be. 


3.2.2. CRF with bigram 

The next stage is evaluation using the CRF method with Bigram features as shown in Table 3 and 
Figure 3. Bigram feature greatly influences the evaluation results. It can be seen that the evaluation value tends 
to increase with the addition of data. (almost the same as CRF without Bigram), but this result is better than 
CRF without Bigram, especially on larger data (500,000 data). 


3.2.3. Comparison table 

The Table 4 and Figure 4 is a comparison of Bustos and Pertusa [21] research (same dataset and 
different methods) and Moharasan and Ho [34] research (different dataset and same method). The results of 
this research indicate that precision 88.07, recall 88.05 and f-1 score 88.06. When compared with previous 
studies Bustos and Pertusa [21] using the same dataset and some deep learning methods like CNN and FastText, 
the results of this study with the CRF method are quite good and an increase in evaluation values. Then when 
compared to the Muharasan’s study [34] which uses CRF and different dataset, our reseach is better then 
previous research, especially by using bigram. 

The model built works better, this is because the context that is studied specifically from the clinical 
trial text document faces a lesser amount of ambiguity than the general context in the classification process. 
Another reason is that the CRF is a statistical-based model, the higher the ratio of causal sentences that appear, 
the more comprehensive the statistical information will be, and the higher the evaluation score will be. 


Table 2. Result of CRF without Bigram 
100,000 200,000 300,000 400.000 500,000 


Precision (%) 70.1 75.1 78.02 80.03 84.01 
Recall (%) 129 74.7 77.35 79.67 84.3 
F-1 Score (%) 71.4 74.9 77.68 7984 84.15 
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Table 3. Result of CRF with bigram 
100,000 200,000 300,000 400,000 500,000 
Precision (%) 68.6 73.6 79.05 83.07 88.07 
Recall (%) 71.7 73.6 78.01 82.72 88.05 
F-1 Score (%) 70.12 73.60 78.53 82.89 88.06 


BPrecision (a) E Precision (a) 








BRecall G4) B Recall Ga) 
mF-1 Score (t3) aF-1 Score (4) 
Ps al r 1 T 7 
100000 8200000 300000 400000 s00000 100000 200000 300000 400000 400000 
Figure 2. Graph of CRF without bigram Figure 3. Graph of CRF with bigram 


Table 4. Result of comparison with Aurelia Bustos and Moharasan 


Event Precision (%) Recall (%) F-1 (%) 
FastText[21] 88.00 86.00 87.00 
CNN [21] 88.00 88.00 88.00 
SVM [21] 79.00 79.00 79.00 
BaseLine[34] 79.17 81.12 80.13 
CRF + Lexical+Syntactic [34] 85.27 1223 78.39 
CRF + Random Selection [34] 86.42 82.25 84.21 
Proposed Method CRF with Bigram 88.07 88.05 88.06 


B Precision ie) 
BRecall (5) 
mF-1 G) 








Figure 4. Result of comparison with Bustos [21] and Moharasan [34] 


4. CONCLUSION 

CRFs are successfully being applied to clinical trials text classification. The evaluation results indicate 
that clinical trial text document, which are freely available, can be exploited by conditional random fields, thus 
opening the potential to explore more ambitious goals by making additional efforts needed to build datasets 
that corresponding. The results of this study are better than the previous results, namely 88.07 precision, 
88.05 recall and f-1 score 88.06. The next research is multilabel classification. The problem will be a multilabel 
classification task, where classes will be "effective" vs. "ineffective" and "learned" vs "not learned", and both 
can be true or false. This will allow us to classify four types of cases: effective and studied, potentially effective 
but not studied, ineffective and learned, and potentially ineffective and not learned. The main effort in this case 
lies in building a dataset, which includes the efficacy results obtained for each study. New models can be 
developed to produce potential cancer treatments that can be considered for certain patient cases based on the 
efficacy of complete clinical trials. 
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